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By Bert Scalzo, Ph.D. EY client site | visit is implementing a data warehouse. Like many other IS 
technologies, it’s the Current rage, SO everyone is doing It. Unfortunately, the 
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situation reminds me of the teenage sex joke: Everyone claims to be doing It, 


yet no one is doing it quite right. 


Data warehouses are radically different ani- 
mals than on-line transaction processing 
(OLTP) systems. Bill Inmon’s books, A 
Data Architecture: The Information Paradigm 
[QED Information Sciences, 1989] and 
Using Oracle to Build Decision Support 
Systems [QED Information Sciences, 1990] 
articulate these differences and make excel- 
lent references for those undertaking the 


design of a data warehouse. 


In this article, I focus on one extremely 
important data warehouse design tech- 
nique: the partitioning of data. 


Successful data partitioning can yield sub- 
stantial performance improvements. 
Although Oracle-specific, these techniques 
on data partitioning may be applicable to 
other relational databases. 


Data warehouses can be enormous, often 
measuring hundreds of gigabytes. DBAs 
confronted with large data warehouse 
tables containing tens or hundreds of mil- 
lions of rows face a major design challenge: 
How can access time be kept reasonable on 
such large tables? 


The obvious answer is to partition the 
data. That is, to separate similar rows by 
simple criteria such as month, year, prod- 
uct line, geographic area, organizational 
unit, etc. 


As Inmon notes, however, there are at least 
two levels at which data partitioning can 
be implemented: the application level, and 
the Oracle level. We'll investigate these two 
options, especially the methods within the 
Oracle level. 


Application-Level Data Partitioning 

The first partitioning approach is to split a 
large table into several smaller ones. For 
example, if the Customers table is too big, 
then split it into four smaller tables based 
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upon region: north, south, 
east, and west. In addition, 
most designers would 
include a view definition 
that is the UNION of the 
partitioned data sets. Thus, 
a customer view that is the 


UNION of north, south, 


east, and west. 


This approach has the bene- 
fit of conceptual simplicity; 
everyone from DBAs to pro- 
grammers clearly under- 
stands this technique. All 
are comfortable with this 
simple solution and its rami- SF 
fications for their participa- Figure 1. 
tion. Moreover, no one 


questions the viability of such a technique. 


The universal belief seems to be that many smaller tables 
will, in fact, perform much better than one large table. In 
addition, Inmon states, “As a rule, there is much more 
flexibility in doing partitioning of the data at the applica- 
tion level.” As such, this one technique is almost univer- 
sally adopted in the construction of data warehouses; near- 
ly everyone is using it. 


However, application-level data partitioning is, at best, a 
“brute force” approach. Yes, it’s simple, and yes, it does 
partition the data. 


But at what cost? General Custer’s loss to the Sioux at 
Little Big Horn was a direct result of the partitioning of 
his army. He had other strategic options that may have 
yielded different results, possibly even a victory. But 
Custer chose a simple and since-debated military strategy: 
divide and conquer. The rest is history, not to mention 
numerous B westerns. 


As such, I refer to the simple adoption of application-level 
data partitioning as the “Custer Mistake.” By simple adop- 
tion, I mean all those designers who choose this technique 
based upon “gut feeling” instead of empirical evidence. 


This design decision is much too important to base upon 
intuition — hard, factual information is required in assess- 
ing this approach and its viability. Otherwise, the resulting 
data warehouse’s performance may not meet expectations. 


While “divide and conquer” is a wonderful concept that 
makes algorithms such as quick sort efficient, it fails miser- 
ably when applied blindly for a data warehouse design. By 
adopting this approach, the designer is in essence treating the 
obvious symptom, numerous rows, instead of the underlying 
disease, random distribution. 





Figure 1 (Left): The Customers table. Its rows are randomly distributed. Figure 2 (Right): Two 
smaller tables, East and West, that contain the horizontally partitioned data from Customers in 
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Figure 3: The result statistics from Explain SQL's test on the 
Customers table. 


Figure 1 displays a single table called Customers with rows 

that are randomly distributed. Figure 2 displays two smaller 
tables, called East and West, that are the horizontally parti- 
tioned data from the Customer table. 


The problem is that the data is still randomly distributed. 
Thus, to retrieve all rows of type A or C, spanning the 
partition will require as much physical I/O as before the 
split. Additionally, to retrieve all rows of type A or B with- 
in a partition would still require significant I/O. 


The point is that reducing the domain's universe does not 
necessarily change the nature of the data distribution. 


I used SQL Tools Inc.’s Explain SQL 1.1.3 to capture the 
statistics of a SELECT on all customers in a particular zip 
code. I tested a single-table query versus a four-table hori- 
zontal partitioning: north, south, east, and west. The 
results are shown in Figures 3 and 4. Note that the CPU 
time, logical I/O, and physical I/O statistics are practically 
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Figure 4: These test results are for the East and West tables 
shown in Figure 2. As you can see, the results in Figure 3 
and the ones shown here are nearly identical, not revealing 
any real improvement in performance. 


identical. So much for multiple, smaller tables being any 
better or faster! 


Statistics also revealed that querying the view definition 
was also much slower than the same query on the single 
table. Furthermore if a simple UNION was used to con- 
struct the view instead of UNION ALL, the performance 
contrast was substantial because of UNION’ sorting. 

In general, application-level data partitioning does not yield 
the desired performance results. 


Oracle-Level Data Partitioning 

There are at least two methods of partitioning at the Oracle 
level: striping and clustering. Striping is defined by the 
Oracle7 Server Administrators Guide as “the practice of divid- 
ing a large table’s data into small portions and storing these 
portions in separate data files on separate disks.” This is done 
to distribute disk I/O and thereby improve performance. 


In essence, striping is a physical implementation of data 
partitioning, performed internally by Oracle. Striping is 
easily understood and furthermore, easier to set up. 
However, it suffers the same problem as application-level 
data partitioning: it does nothing to control the random 
distribution of data. 


In fact, one can argue that it yields even less control than 
manually splitting the table. For a striped table, Oracle 
merely allocates storage units known as extents on a first 
available basis from the table space’s free list. And that free 
list is not cognizant of striping — it’s merely a linked list 
of the table space’s free extents. 


In fact, unless you manually load the table’s data in the 
desired pre-sorted order, striping would do nothing to 
improve the distribution problem. Furthermore, even if the 
data is loaded such that it is sorted correctly, striping will 
not necessarily maintain that ordering over time. 


Finally, Oracle’s Installation Guide states that operating 
systems which inherently provide striping generally yield 
better results than manual striping (e.g. Novell’s NetWare 
using logical volumes and Hewlett-Packard’s HP-UX with 
Volume Manager). Hence, manual striping is not our 
solution either. 


Fiction and Fact: Clustering 

Clustering, the other Oracle-level method for partitioning 
data, is one of the most misunderstood and under used data- 
base features. Many people have never heard of clusters, and 
still more have judged them as unusable without reason. 


Not too long ago when I worked for Oracle, there was a flur- 
ry of internal e-mail regarding the use of clusters. It amazed 
me to learn the number of Oracle consultants and instructors 
who did not understand nor prescribe the use of clusters. 


More alarmingly, there was a general tendency for “junior” 
people to blindly subscribe to the opinions of “senior” 
people. Nearly everyone within Oracle universally labeled 
clusters as useless, and then attempted to make each other 
feel good about this position. In addition, otherwise good 
Oracle reference books include condemning statements 
regarding indexed clusters. 


With such pervasive misinformation and ignorance, it’s no 
wonder that clusters have seen such little use. And that’s a 
shame, because clusters truly offer the key to successful 
data partitioning, especially for data warehouses. 


Those who are familiar with clustering generally know 
Oracle’s multi-table clustering capability. In this form of 
clustering, Oracle stores rows physically close together 
from multiple tables. Specifically, it stores a parent row 
from table A and all its related child rows from table B 
within the same Oracle block. I often refer to this as stor- 
ing these tables “pre-joined.” 


Figure 5 demonstrates the clustering of two tables: 
Departments and their associated Employees. To the applica- 
tion developer, two separate tables exist and they write DML 
against the tables using JOINs. They are entirely shielded 
from the physical storage of the data (as per Codd’s relational 
rule for physical independence). 


However, since Oracle has internally placed parent and 
child records together within the same block, JOIN opera- 


tions are much quicker as total required I/O is reduced. 


Under ideal circumstances, these performance improvements 
can be quite substantial. But in general, multi-table clustering 
has only limited practicality and still does not solve our under- 
lying problem regarding the distribution of data for large tables. 


Of those who do know of multi-table clustering, many don’t 
realize that Oracle can also cluster upon a single table, nor 
know why that technique is so useful. But it’s exactly this 
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Figure 5 (Left): A database that contains information from the Departments table and the 
Employees table. Figure 6 (Right): Storage of data in the Customers table if the cluster index was 


on the column identifying the record type. 
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Figure 7: Indexed single-table clustering provides a great 
improvement in data distribution performance for the Customers 
table. 
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Figure 8: Likewise, this test with Explain SQL reveal improved 
performance results with hashed, single-table clustering on the 
Customers table. 


clustering on a single table that offers a solution to our data 
distribution problem. 





In single-table clustering, 
rows of data from one table 
are stored physically close 
together based upon the 
value of the cluster index. 
Specifically, rows from table 
A having the same value are 
stored within the same 
Oracle block. I often refer 
to this as storing a table 


“pre-ordered by.” 


Figure 6 exhibits how our 
earlier Customers table 
would be physically stored if 
the cluster index was on the 
column that identifies the 
record type (i.e. on the col- 
umn designating a row as either an A, B, C, or D). Notice 
how the rows are placed sorted on the physical media. 


Moreover, this sorted order is maintained over time. 
Queries accessing major groupings, such as retrieving all 
‘A’s and ‘B’s, will locate the data very quickly as the physi- 
cal media’s read/write arm will have to perform minimal 
amounts of movement. Thus, single-table clustering is the 
solution to our data distribution problem for a single, 


large table. 


Figures 7 and 8 show the statistics for both indexed and 
hashed, single-table clustering of our original, large 
Customers table, sorted by postal zip code. Note that the 
limited test data used resulted in a logical I/O reduction of 
44 percent and an execution time reduction of six percent. 
Actual project data with over 50 million customer records 
yielded 33 percent and 25 percent, respectively. 


Conclusion 

Now that we've separated fact from fiction, you can rely 
on empirical evidence, rather than just “gut feelings,” to 
decide if clustering can help you improve your data ware- 
house performance. Cl 
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